Section 9 - Get group-wise insights: group_by
Unite and conquer using group_by
group_by() lets you define groups within your data set. Its influence becomes clear when calling summarize() on a grouped dataset: summarizing statistics are calculated for the different groups separately.
In this exercise, you are going to create an ordered per-carrier summary of hflights by combining group_by(), summarize() and arrange().
library(hflights)
hflights_df <- hflights[sample(nrow(hflights), 720), ]
hflights <- as_tibble(hflights)
# Rename Carrier with long names
lut <- c("AA" = "American", "AS" = "Alaska", "B6" = "JetBlue", "CO" = "Continental",
"DL" = "Delta", "OO" = "SkyWest", "UA" = "United", "US" = "US_Airways",
"WN" = "Southwest", "EV" = "Atlantic_Southeast", "F9" = "Frontier",
"FL" = "AirTran", "MQ" = "American_Eagle", "XE" = "ExpressJet", "YV" = "Mesa")
hflights$UniqueCarrier<- lut[hflights$UniqueCarrier]
# Make an ordered per-carrier summary of hflights
hflights %>%
group_by(UniqueCarrier) %>%
summarize(
p_canc = 100 * mean(Cancelled == 1),
avg_delay = mean(ArrDelay, na.rm = TRUE)
) %>%
arrange(avg_delay, p_canc)## # A tibble: 15 x 3
## UniqueCarrier p_canc avg_delay
## <chr> <dbl> <dbl>
## 1 US_Airways 1.13 -0.631
## 2 American 1.85 0.892
## 3 AirTran 0.982 1.85
## 4 Alaska 0 3.19
## 5 Mesa 1.27 4.01
## 6 Delta 1.59 6.08
## 7 Continental 0.678 6.10
## 8 American_Eagle 2.90 7.15
## 9 Atlantic_Southeast 3.45 7.26
## 10 Southwest 1.55 7.59
## 11 Frontier 0.716 7.67
## 12 ExpressJet 1.55 8.19
## 13 SkyWest 1.39 8.69
## 14 JetBlue 2.59 9.86
## 15 United 1.64 10.5
Combine group_by with mutate
You can also combine group_by() with mutate(). When you mutate grouped data, mutate() will calculate the new variables independently for each group. This is particularly useful when mutate() uses the rank() function, that calculates within-group rankings. rank() takes a group of values and calculates the rank of each value within the group, e.g.
rank(c(21, 22, 24, 23))
has output
[1] 1 2 4 3
As with arrange(), rank() ranks values from the smallest to the largest.
# Ordered overview of average arrival delays per carrier
hflights %>%
filter(!is.na(ArrDelay), ArrDelay > 0) %>%
group_by(UniqueCarrier) %>%
summarize(avg = mean(ArrDelay))%>%
mutate(rank = rank(avg)) %>%
arrange(rank)## # A tibble: 15 x 3
## UniqueCarrier avg rank
## <chr> <dbl> <dbl>
## 1 Mesa 18.7 1
## 2 Frontier 18.7 2
## 3 US_Airways 20.7 3
## 4 Continental 22.1 4
## 5 Alaska 22.9 5
## 6 SkyWest 24.1 6
## 7 ExpressJet 24.2 7
## 8 Southwest 25.3 8
## 9 AirTran 27.9 9
## 10 American 28.5 10
## 11 Delta 32.1 11
## 12 United 32.5 12
## 13 American_Eagle 38.8 13
## 14 Atlantic_Southeast 40.2 14
## 15 JetBlue 45.5 15
Advanced group_by exercises
By now you’ve learned the fundamentals of dplyr: the five data manipulation verbs and the additional group_by() function to discover interesting group-wise statistics. The next challenges are an all-encompassing review of the concepts you have learned about. We already provided you with a template of the piped call that can solve the exercises. Up to you to finish all dplyr calls! For simplicity, you can include cancelled flights in your answers, so you shouldn’t filter based on the Cancelled column.
# How many airplanes only flew to one destination?
hflights %>%
group_by(TailNum) %>%
summarize(ndest = n_distinct(Dest)) %>%
filter(ndest == 1) %>%
summarize(nplanes = n())## # A tibble: 1 x 1
## nplanes
## <int>
## 1 1526
# Find the most visited destination for each carrier
hflights %>%
group_by(UniqueCarrier, Dest) %>%
summarize(n = n()) %>%
mutate(rank = rank(desc(n))) %>%
filter(rank == 1)## # A tibble: 15 x 4
## # Groups: UniqueCarrier [15]
## UniqueCarrier Dest n rank
## <chr> <chr> <int> <dbl>
## 1 AirTran ATL 2029 1
## 2 Alaska SEA 365 1
## 3 American DFW 2105 1
## 4 American_Eagle DFW 2424 1
## 5 Atlantic_Southeast DTW 851 1
## 6 Continental EWR 3924 1
## 7 Delta ATL 2396 1
## 8 ExpressJet CRP 3175 1
## 9 Frontier DEN 837 1
## 10 JetBlue JFK 695 1
## 11 Mesa CLT 71 1
## 12 SkyWest COS 1335 1
## 13 Southwest DAL 8243 1
## 14 United SFO 643 1
## 15 US_Airways CLT 2212 1
Section 10 - dplyr and databases
dplyr deals with different types
hflights2 is a copy of hflights that is saved as a data table. hflights2 was made available in the background using the following code:
library(data.table)##
## Attaching package: 'data.table'
## The following objects are masked from 'package:dplyr':
##
## between, first, last
hflights2 <- as.data.table(hflights)hflights2 contains all of the same information as hflights, but the information is stored in a different data structure. You can see this structure by typing hflights2 at the command line.
Even though hflights2 is a different data structure, you can use the same dplyr functions to manipulate hflights2 as you used to manipulate hflights.
# Use summarize to calculate n_carrier
hflights2 %>%
summarize(n_carrier = n_distinct(UniqueCarrier))## n_carrier
## 1 15
dplyr and mySQL databases
DataCamp hosts a mySQL database with data about flights that departed from New York City in 2013. The data is similar to the data in hflights, but it does not contain information about cancellations or diversions. With the tbl() function, we already created a reference to a table in this information.
Although nycflights is a reference to data that lives outside of R, you can use the dplyr commands on them as usual. Behind the scenes, dplyr will convert the commands to the database’s native language (in this case, SQL), and return the results. This allows you to pull data that is too large to fit in R: only the fraction of the data that you need will actually be downloaded into R, which will usually fit into R without memory issues.
library(RMySQL)## Loading required package: DBI
library(dbplyr)##
## Attaching package: 'dbplyr'
## The following objects are masked from 'package:dplyr':
##
## ident, sql
# Set up a connection to the mysql database
my_db <- src_mysql(dbname = "dplyr",
host = "courses.csrrinzqubik.us-east-1.rds.amazonaws.com",
port = 3306,
user = "student",
password = "datacamp")
# Reference a table within that source: nycflights
nycflights <- tbl(my_db, "dplyr")
# glimpse at nycflights
glimpse(nycflights)## Observations: ??
## Variables: 17
## Database: mysql 5.6.34-log [student@courses.csrrinzqubik.us-east-1.rds.amazonaws.com:/dplyr]
## $ id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1...
## $ year <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013...
## $ month <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ day <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ dep_time <int> 517, 533, 542, 544, 554, 554, 555, 557, 557, 558, 55...
## $ dep_delay <int> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2, -2,...
## $ arr_time <int> 830, 850, 923, 1004, 812, 740, 913, 709, 838, 753, 8...
## $ arr_delay <int> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -3, 7,...
## $ carrier <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV", "B6"...
## $ tailnum <chr> "N14228", "N24211", "N619AA", "N804JB", "N668DN", "N...
## $ flight <int> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79, 301...
## $ origin <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR", "LG...
## $ dest <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL", "IA...
## $ air_time <int> 227, 227, 160, 183, 116, 150, 158, 53, 140, 138, 149...
## $ distance <int> 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 944, 73...
## $ hour <int> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 6...
## $ minute <int> 17, 33, 42, 44, 54, 54, 55, 57, 57, 58, 58, 58, 58, ...
# Ordered, grouped summary of nycflights
nycflights %>%
group_by(carrier) %>%
summarize(n_flights = n(),
avg_delay = mean(arr_delay, na.rm = TRUE)) %>%
arrange(avg_delay)## Warning in .local(conn, statement, ...): Decimal MySQL column 2 imported as
## numeric
## # Source: lazy query [?? x 3]
## # Database: mysql 5.6.34-log
## # [student@courses.csrrinzqubik.us-east-1.rds.amazonaws.com:/dplyr]
## # Ordered by: avg_delay
## carrier n_flights avg_delay
## <chr> <dbl> <dbl>
## 1 AS 714 -9.86
## 2 HA 342 -6.92
## 3 AA 32729 0.356
## 4 DL 48110 1.63
## 5 VX 5162 1.75
## 6 US 20536 2.06
## 7 UA 58665 3.50
## 8 9E 18460 6.91
## 9 B6 54635 9.36
## 10 WN 12275 9.47
## # ... with more rows
Session info
sessionInfo()## R version 3.5.2 (2018-12-20)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 16299)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=German_Switzerland.1252 LC_CTYPE=German_Switzerland.1252
## [3] LC_MONETARY=German_Switzerland.1252 LC_NUMERIC=C
## [5] LC_TIME=German_Switzerland.1252
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] dbplyr_1.2.2 RMySQL_0.10.16 DBI_1.0.0
## [4] data.table_1.12.0 hflights_0.1 ggplot2_3.1.0
## [7] dplyr_0.8.0.1 gapminder_0.3.0 kableExtra_1.0.1
## [10] knitr_1.21
##
## loaded via a namespace (and not attached):
## [1] Rcpp_1.0.0 plyr_1.8.4 pillar_1.3.1
## [4] compiler_3.5.2 prettydoc_0.2.1 tools_3.5.2
## [7] digest_0.6.18 gtable_0.2.0 evaluate_0.12
## [10] tibble_2.0.1 viridisLite_0.3.0 pkgconfig_2.0.2
## [13] rlang_0.3.1 cli_1.0.1 rstudioapi_0.9.0
## [16] yaml_2.2.0 xfun_0.4 withr_2.1.2
## [19] httr_1.4.0 stringr_1.4.0 xml2_1.2.0
## [22] hms_0.4.2 webshot_0.5.1 grid_3.5.2
## [25] tidyselect_0.2.5 glue_1.3.0 R6_2.4.0
## [28] fansi_0.4.0 rmarkdown_1.11 readr_1.3.1
## [31] purrr_0.3.0 magrittr_1.5 scales_1.0.0
## [34] htmltools_0.3.6 assertthat_0.2.0 rvest_0.3.2
## [37] colorspace_1.4-0 utf8_1.1.4 stringi_1.3.1
## [40] lazyeval_0.2.1 munsell_0.5.0 crayon_1.3.4